Computational Statistics
2024-08-26
Important
Before Wednesday, listen to the full conversation of Not So Standard Deviations - Compromised Shoe Situation.
Important
I need your GitHub user name - please email it to me.
By the end of the course, you will be able to…
Example of how data and algorithms are used to make decisions.
http://algorithms-tour.stitchfix.com/
In 2013, DiGrazia et al. published a provocative paper suggesting that polling could now be replaced by analyzing social media data. They analyzed 406 competitive US congressional races using over 3.5 billion tweets. In an article in The Washington Post one of the co-authors, Rojas, writes: “Anyone with programming skills can write a program that will harvest tweets, sort them for content and analyze the results. This can be done with nothing more than a laptop computer.” (Rojas, 2013)
Spend a few minutes reading the Rojas editorial. Be sure to consider Figure 1 carefully, and address the following questions.
Discuss Figure 1 with your neighbor. What is its purpose? What does it convey? Think critically about this data visualization. What would you do differently?
How would you improve the plot? I.e., annotate it to make it more convincing / communicative? Does it need enhancement?
Do you think the study holds water? Why or why not? What are the shortcomings of this study?
Imagine that your boss, who does not have advanced technical skills or knowledge, asked you to reproduce the study you just read. Discuss the following with your neighbor.
What steps are necessary to reproduce this study? Be as specific as you can! Try to list the subtasks that you would have to perform.
What computational tools would you use for each task?
Identify all the steps necessary to conduct the study. Could you do it given your current abilities & knowledge? What about the practical considerations?
Cheap
Can measure any political race (not just the wealthy ones).
Is it really reflective of the voting populace? Who would it bias toward?
Does simple mention of a candidate always reflect voting patterns? When wouldn’t it?
Margin of error of 2.7%. How is that number typically calculated in a poll? Note: \(2 \cdot \sqrt{(1/2)(1/2)/1000} = 0.0316\).
Tweets feel more free in terms of what you are able to say - is that a good thing or a bad thing with respect to polling?
Can’t measure any demographic information.
Gelman: look only at close races
Gelman: “It might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness.”
Gelman: “And scale the size of each dot to the total number of tweets for the two candidates in the election.”
Gelman: Make the data publicly available so that others can try to reproduce the results
https://statmodeling.stat.columbia.edu/2013/04/24/the-tweets-votes-curve/
We use tools to do the things. But the tools are not the things.
What does it mean for a data analysis to be “reproducible”?
Short-term goals:
Long-term goals:
Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1
As of August 26, 2024, there are 21,145 R packages available on CRAN (the Comprehensive R Archive Network)2
We’re going to work with a small (but important) subset of these!
$:Render, the analysis is run from the beginningImportant
The environment of your Quarto document is separate from the Console!
Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!
GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better
We will use GitHub as a platform for web hosting and collaboration (and as our course management system!)
Important
Before next Tuesday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months prior.
Science Editor-in-Chief Marcia McNutt:
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we’ll discuss can’t prevent this, but they can make it easier to discover issues.
Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent
“The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.”
Source: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/
Scriptability → R [in contrast to pull down menus]
Literate programming → R Markdown [in contrast to multiple files]
Version control → Git / GitHub [in contrast to multiple versions]
“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do.”
R: think “python”
R Studio: think “jupyter notebook” or “Google Colab”
Taken from Modern Drive: An introduction to statistical and data sciences via R, by Ismay and Kim
Jessica Ward, PhD student at Newcastle University
On GitHub (on the web) edit the README document and Commit it with a message describing what you did.
Then, in RStudio also edit the README document with a different change.
As you work in teams you will run into merge conflicts, learning how to resolve them properly will be very important.
What was Hilary trying to answer in her data collection?
Name two of Hilary’s main hurdles in gathering accurate data.
Which is better: high touch (manual) or low touch (automatic) data collection? Why?
What additional covariates are needed / desired? Any problems with them?
How much data does she need?